Introduction

CytoSignal detects cell-cell signaling from spatial transcriptomics data at single-cell resolution. CytoSignal performs a nonparametric statistical test to identify which cells within a tissue have significant activity for a particular signaling interaction. CytoSignal considers multi-component interactions and separately models interactions mediated by diffusion vs. contact-dependent molecules.

Example data

In this tutorial, we demonstrate how to preform CytoSignal on a mouse embryonic E14 brain data captured by Slide-tags, a spatial transcriptomic protocol at single-cell resolution. This data was originally produced in the work of Andrew J. C. Russell et al., 2023, and is now publicly available on Single-Cell Portal via SCP2170. For convenience, we have pre-processed the data so it can be easily loaded into user R environment, which is available at figshare.

Download the files listed above from the provided link into a desired local directory.

Create CytoSignal object

Loading example data

Next, load the downloaded data into the R environment using the following codes. We assume that users have the files placed at the current working directory (as can be shown by getwd()). Alternatively, users can also specify the path to the files.

## The RDS file will be loaded into a ready-to-use object
dge <- readRDS("SCP2170_annotated_dgCMatrix.rds")

## The cluster annotation need to be presented as a factor object
cluster <- read.csv("SCP2170_cluster.csv")
cluster <- factor(cluster$cell_type)
names(cluster) <- colnames(dge)

## The spatial coordinates need to be presented as a matrix object
spatial <- as.matrix(read.csv("SCP2170_spatial.csv", row.names = 1))
## Please make sure that the dimension names are lower case "x" and "y"
colnames(spatial) <- c("x", "y")

Then a CytoSignal object can be created with the following command.

library(cytosignal)

cs <- createCytoSignal(raw.data = dge, cells.loc = spatial, clusters = cluster)

Adding Ligand-receptor Database

CytoSignal has a built-in ligand-receptor interaction database resorted from CellphoneDB, which can be simply loaded into the object running the following command.

cs <- addIntrDB(cs, g_to_u, db.diff, db.cont, inter.index)

Preprocessing

Next, remove low-quality cells and genes, retain only those genes available in the interaction database, and convert gene names to Uniprot IDs.

cs <- removeLowQuality(cs, counts.thresh = 300)
cs <- changeUniprot(cs)

Inferring spatilly resolved significant Ligand-receptor interation activities

The spatially resolved interaction scores of each interaction in each location (LRscore) is defined as the co-expression of ligand and receptor genes within close spatial proximity. The computation can be divided into three main steps: 1) defining spatial neighbors for each location; 2) calculating the amount of ligand (\(L\)) and receptor (\(R\)) each location can receive from their spatial neighbors; 3) calculating ligand-receptor co-expression within each spatial neighborhood; 4) performing spatial permutation test to infer significant interactions.

Defining spatial neighborhoods

CytoSignal defines spatial neighborhoods for diffusion-dependent and contact-dependent interactions differently. We use the Epsilon ball approach for diffusible interactions and Delaunay Triangulation for contact-dependent interactions.

For diffusion-dependent interactions, for each location \(i\), we define its spatial neighbors as all locations \(j\) within a circle centered on location \(i\) with a predefined radius \(r\) (200 µm by default). We next weight the amount of \(L\) that \(i\) receives based on the physical distance between \(i\) and \(j\) transformed by a Gaussian kernel. For determining the parameters of this kernel, we will need a scaling factor between the arbitrary units of the spatial coordinates and real unit (such as µm), which is based on prior knowledge of user dataset. For the specific example Slide-tags data, the conversion ratio is 0.73.

cs <- inferEpsParams(cs, scale.factor = 0.73)

It’s recommended to review the inferred parameters for a sanity check.

cs@parameters$r.diffuse.scale
## [1] 273.9726
cs@parameters$sigma.scale
## [1] 73.70953

For each location, we can then use findNN() to find its spatial neighborhood and then calculate the weights between each it and its neighbors. The results for diffusible interactions are saved with model name GauEps and contact-dependent interactions are saved with model name DT.

cs <- findNN(cs)

Inferring ligand and receptor expression in neighborhood

The next step is to calculate the amount of ligand (\(L\)) and receptor (\(R\)) each location can receive from their spatial neighbors using function imputeLR.

cs <- imputeLR(cs)

Calculating ligand-receptor co-expression and identifying spatially significant interactions

For calculating the LRscore of each interaction within each spatial neighborhood, we multiply \(L\) and \(R\) within each location and apply an average within its DT neighborhood. Next, we perform a spatial permutation test, calculate the null distribution of the imputed ligands and receptors, and calculate a one-sided p-value. To control for multiple hypothesis testing and potential biases caused by cellular density differences, we further perform spatial false discovery rate correction.

The output of CytoSignal is a test statistic \(S\) and adjusted p-value for each signaling interaction within each location in the tissue. Finally, cells with significant signaling activity can be identified by setting a significance level such as p.value = 0.05. For convenience, we rank interactions by either the number of significant spatial positions or the spatial variability statistics of the LRscore calculated by SPARK-X.

The function inferIntrScore is an integrated function that performs the LRscore calculation and permutation tests. For the calculation of LRscore, by default, receptors are taken from the normalized expression. Ligands of diffusible interactions are imputed with Gaussian Epsilon ball (GauEps-Raw), while those from contact-dependent interactions are imputed with Delaunay Triangulation (DT-Raw). The function also provides an option recep.smooth to smooth the receptor expressions with Delaunay Triangulation (GauEps-DT and DT-DT), which is useful when the sparsity of the data is relatively high. The function inferSignif() should be called subsequently to identify significant interactions. For ranking the interactions by spatial variability statistics, we provide rankIntrSpatialVar().

cs <- inferIntrScore(cs)
cs <- inferSignif(cs, p.value = 0.05, reads.thresh = 100, sig.thresh = 100)
cs <- rankIntrSpatialVar(cs)

reads.thresh is the minimum number of reads for a ligand-receptor interaction to be considered. sig.thresh is the minimum number of beads for a ligand-receptor interaction to be considered. The three thresholds can be changed arbitrarily if the number of the significant beads is too large or too small. The function inferSignif() by default is applied to all LRscore types available. When any thresholds need to be tweaked for a specific LRscore type, users can specify the slot.use argument explicitly.

Users can use function showIntr() to view all significant interactions.

The argument slot.use is for specifying the LRscore calculation model used. Possible options are explained as followed:

  • "GauEps-Raw": Ligand expressions are imputed with Gaussian Epsilon ball and receptor expressions are taken from normalized expression. Typically for the diffusion-dependent interactions.
  • "DT-Raw": Ligand expressions are imputed with Delaunay Triangulation and receptor values are taken from raw expression. Typically for the contact-dependent interactions.
  • "GauEps-DT": Ligand expressions are imputed with Gaussian Epsilon ball and receptor values are imputed with Delaunay Triangulation. This is for diffusion-dependent interactions, but exists only when inferIntrScore() is run with recep.smooth = TRUE.
  • "DT-DT": Ligand values and receptor values are all imputed with Delaunay Triangulation. This is for contact-dependent interactions, but exists only when inferIntrScore() is run with recep.smooth = TRUE.

The argument signif.use can return interactions that are ranked by different metrics:

  • "result": have p-value less that specified threshold.
  • "result.hq": have significant p-value and passes the quality control on the minimum number of reads and beads as mentioned above.
  • "result.spx": are spatially variable and are of high quality according to that in "result.hq".

Setting return.name = TRUE displays both interaction unique IDs and the interaction names for easier interpretation. Interaction names are shown as a “ligand-receptor” gene symbol.

allIntrs <- showIntr(cs, slot.use = "GauEps-Raw", signif.use = "result.spx", return.name = TRUE)
print(head(allIntrs))
## CPI-SS0659DBE72 CPI-SS04124F4E1 CPI-SS0008137B6 CPI-SS00F4DDF4B CPI-SS0E063192D 
##   "EFNA4-EPHA4"     "EPO-EFNB2"   "EFNA4-EPHA5"     "PTN-PTPRS"    "NRG2-ERBB4" 
## CPI-SS0C694CB44 
##   "VEGFA-NMDE2"

Visualization

CytoSignal makes cellular-resolution, spatially-resolved signaling inference. We developed several new visualizations for plotting each inferred signaling interaction.

Most of the functions take two arguments, intr and slot.use, for specifying individual interactions. Users can first show a list of significant interactions that is available for visualization with showIntr(). intr should then be the unique ID of available interaction(s) and slot.use should be a selection as explained above.

Visualizing significant cell-cell communication in 3D

We developed a 3D edge plot for visualizing the signal-sending and signal-receiving cells of an interaction at single-cell resolution. The plot comes with two layers of scatter plot of cells placed at the top and bottom of a 3D box space, for labeling the receivers and senders, respectively. In each layer, we use low transparency to highlight the cells that is sending or receiving the signal of the specified interaction. Cells are colored by cluster at the mean time. Finally, we bring lines that connect the senders with their corresponding tentative receivers, which are the edges.

intr.use <- names(allIntrs)[1]
plotEdge(cs, intr.use, slot.use = "GauEps-Raw", pt.size = 0.15)

To plot the cluster annotation legend alone, users can simply use plotCluster() for reference.

plotCluster(cs)

Combination plots of 3D edge plot, gene expression, LRscore and cluster annotation

We provide a function named plotSignif2 for making a general combination figure. This function plots on a per-interaction basis, and for each interaction it shows 1) the imputed gene expression of the ligand and receptor; 2) the original raw expression values of the ligand and receptor; 3) the inferred LRscore; 4) cluster annotations for each location; 5) 3D edge plot for visualizing the signal-sending and signal-receiving cells. General graphical setting can be found in the documentation of the function (?plotSignif2).

plotSignif2(cs, intr = intr.use, slot.use = "GauEps-Raw", return.plot = TRUE, edge = TRUE, pt.size = 0.2)

Recommended: When plotting a large number of interactions, it’s recommended to use numeric index with intr and turn to return.plot = FALSE (by default). In this way, plots of all input interactions at high resolution will be stored on the device instead of on the screen.

The codes below show the top 5 significant interactions ranked by SPARK-X for both diffusion-dependent and contact-dependent interactions.

# For diffusion dependent interactions
signif.use <- "result.spx"
lrscore.slot <- "GauEps-Raw"
plot_dir <- paste0("path_to_result/diff-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)

# For contact dependent interactions
## Choose a level of significance metric
lrscore.slot <- "DT-Raw"
plot_dir <- paste0("path_to_result/cont-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)

Downstream Analysis

We provide a few downstream analysis approaches to further investigate the biological significance from interactions identified by CytoSignal.

Regression analysis on differentially expressed genes

With function inferIntrDEG(), we can identify genes that are differentially expressed in the cells that are highly correlated to an interaction. After obtaining a gene set for each interaction, we further fit it to a glmnet model to do a regression analysis. We finally examine the coefficients of the model to determine the significant genes for the interaction.

The returned object is a list where each element is a result list for each interaction. The sub-entry intrDEG[[intr.use]]$sign_genes is the character vector containing the genes that are significant for the interaction.

intrDEG <- inferIntrDEG(cs, intr = intr.use, slot.use = "GauEps-Raw", signif.use = "result.spx")
intrDEG[[1]]$sign_genes
##  [1] "Gm29260"       "Erbb4"         "Epha4"         "Zeb2"         
##  [5] "Neb"           "Mpped2"        "Meis2"         "Efna4"        
##  [9] "Mob3b"         "Mki67"         "Gm45680"       "B020031H02Rik"
## [13] "Sox1ot"        "Hmgb2"         "Nfix"          "Cdon"         
## [17] "Gm30373"       "Nrxn3"         "Gli3"          "Gas1"         
## [21] "Dleu2"         "Gm27017"       "Celsr1"        "Zbtb20"       
## [25] "Tcf4"          "Gldc"          "Emx2os"        "Sytl5"

GO term enrichment analysis

With the gene set, we can further do a GO term enrichment analysis. Technically this can be done with any GO tools of preference. We used GORILLA in our manuscript. Here we provide an example of using the R package gprofiler2.

# Need to set `evcodes = TRUE` to get list genes that hits each GO term
goRes <- gprofiler2::gost(intrDEG[[1]]$sign_genes,
                          organism = "mmusculus", evcodes = TRUE)

We provide one more function heatmap_GO() for showing the the GO term by the genes submitted. The heatmap will be colored by the coefficient value that is calculated from the regression analysis, while white-colored grids indicate none intersection between the corresponding GO terms.

With GO analysis conducted with other tools, we require at least three columns from the result table for this function to work:

  • A column for the GO term names, i.e. the short text description. The name of this column need to be set to argument description.col and the content will be shown as the row labels of the heatmap.
  • A column for the p-values of the GO terms, for ranking the terms. The name of this column need to be set to argument pval.col.
  • A column for the gene intersection evidence. Should have some thing like "gene1, gene2, gene3" so we can identify the hit. The name of this column need to be set to argument gene.col. This information could have been formatted variously by different tools, users might need to further provide a splitting function that splits the string into a character vector of the gene names to argument gene.split.fun.
heatmap_GO(intrDEG, goRes$result, intr = intr.use, 
           description.col = "term_name", pval.col = "p_value", 
           gene.col = "intersection")

In our manuscript, we used another web-based tool REVIGO to visualize the GO terms. Users can simply copy the term IDs from the result into the web form and view the visualization interactively in browser. For users with limited access to interactive session (e.g. on HPCs), we provided function revigo() to programmatically submit query and fetch the result from their server. The queried result allows us generating a scatter plot where each point represents a GO term and more semantically similar GO terms are also closer in the plot. The color scale presents the value when we submit the query, which is the log transformed p-value in this example. The size of dots indicates the frequency of the GO term in the underlying GOA database, so that bubbles of more general terms are larger.

revigoRes <- revigo(GOterms = goRes$result$term_id, 
                    value = goRes$result$p_value, valueType = "PValue",
                    speciesTaxon = "10090")
plotREVIGO(revigoRes, labelSize = 3)